Chapter 10: Unsupervised Learning

Supervised Learning vs. Unsupervised Learning

Fitting a predictive model using a supervised learning technique, we can check our work by seeing how well our model predicts the response $Y$ on observations not used in fitting the model. On the contrary, this is not the case when it comes to using unsupervised learning. This means there is no way to check our work because we don’t know the true answer—the problem is unsupervised.

Principal Components Analysis (PCA)

In dealing with a large set of correlated variables, principal components can be handy by means of summarizing this set with a smaller number of representative variables that collectively explain most of the variability in the original set.

Suppose that we wish to visualize n observations with measurements on a set of p features, $X_1$, $X_2$, $\ldots$ , $X_p$, as part of an exploratory data analysis.

  1. Given a $n \times p$ data set $X$, the first principal component loading vector solves the optimization problem
\begin{align}\max_{\phi_{11}, \ldots, \phi_{p1}}\left\{ \frac{1}{n} \sum_{i=1}^{n} \left(\sum_{j=1}^{p} \phi_{j1} x_{ij}\right )^2\right \}, \quad \mbox{subject to} \sum_{j=1}^{p} \phi_{j1}^2.\end{align}

Letting $z_{ij} = \sum_{j=1}^{p} \phi_{j1} x_{ij}$, we have can express this problem as follows,

\begin{align}\max_{\phi_{11}, \ldots, \phi_{p1}}\left\{ \frac{1}{n} \sum_{i=1}^{n} z_{ij}^2\right \}, \quad \mbox{subject to} \sum_{j=1}^{p} \phi_{j1}^2.\end{align}

Note that $\frac{1}{n}\sum_{i=1}^{n} z_{ij}^2 = 0$ as $\frac{1}{n}\sum_{i=1}^{n} x_{ij} = 0$. Moreover, $z_{ij}$ are often refered as the scores of the first principal component.

  1. The second principal component is the linear combination of $X_1$, $X_2$, $\ldots$ , $X_p$ that has maximal variance out of all linear combinations that are uncorrelated with $Z_1$. The second principal component scores $z_{12}$, $z_{22}$, $\ldots$ , $z_{n2}$ take the form $$z_{i2} = \sum_{j=1}^{p} \phi_{j2}x_{ij}$$

where $\phi_{2} =\left[ \phi_{12}, \phi_{22},\ldots ,\phi_{p2}\right] $is the second principal component loading vector.

This dataset can be extracted from the ISLR package using the following syntax.

library (ISLR)
write.csv(USArrests, "USArrests.csv")

We can also examine the variances of the four variables

We can scale the variables to have standard deviation one. This can be done using StandardScaler.

Now,

We use sklearn principal component analysis (PCA).

We can obtain a summary of the proportion of variance explained (PVE) of the first few principal components.

We see that the first principal component explains 62.0 % of the variance in the data, the next principal component explains 24.7 % of the variance, and so forth. We can plot the proportion of variance explained (PVE) explained by each component, as well as the cumulative PVE, as follows:

K-Means Clustering

We begin with a simple simulated example in which there truly are two clusters in the data: the first 25 observations have a mean shift relative to the next 25 observations.

We now perform K-means clustering with K = 2.

We can also performe K-means clustering on this example with K = 3.

Hierarchical Clustering

In this section we use scipy.cluster.hierarchy.dendrogram to plot the hierarchical clustering as a dendrogram.

For this data, complete and average linkage generally separate the observations into their correct groups. However, single linkage identifies one point as belonging to its own cluster. A more sensible answer is obtained when four clusters are selected, although there are still two singletons.

In this section, we work on NCI60 cancer cell line microarray data, which consists of 6,830 gene expression measurements on 64 cancer cell lines.

This dataset can be extracted from the ISLR package using the following syntax.

library (ISLR)
write.csv(NCI60, "NCI60.csv")

PCA on the NCI60 Data

Projections of the NCI60 cancer cell lines onto the first three principal components (in other words, the scores for the first three principal components). On the whole, observations belonging to a single cancer type tend to lie near each other in this low-dimensional space. It would not have been possible to visualize the data without using a dimension reduction method such as PCA since based on the full data set there are $\begin{pmatrix}6,830\\2\end{pmatrix}$ possible scatterplots, none of which would have been particularly informative.

The PVE of the principal components of the NCI60 cancer cell line microarray data set.

Clustering the Observations of the NCI60 Data

We now proceed to hierarchically cluster the cell lines in the NCI60 data. To begin, we standardize the variables to have mean zero and standard deviation one. As mentioned earlier, this step is optional and should be performed only if we want each gene to be on the same scale.

The NCI60 cancer cell line microarray data, clustered with average, complete, and single linkage, and using Euclidean distance as the dissimilarity measure. Complete and average linkage tends to yield evenly sized clusters whereas single linkage tends to yield extended clusters to which single leaves are fused one by one.

How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with K = 4?

Observations per Hierarchical cluster

Hierarchy based on Principal Components 1 to 5


Refrences

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.
  2. Jordi Warmenhoven, ISLR-python
  3. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R